20 research outputs found

    Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks

    Full text link
    Recognizing arbitrary multi-character text in unconstrained natural photographs is a hard problem. In this paper, we address an equally hard sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from Street View imagery. Traditional approaches to solve this problem typically separate out the localization, segmentation, and recognition steps. In this paper we propose a unified approach that integrates these three steps via the use of a deep convolutional neural network that operates directly on the image pixels. We employ the DistBelief implementation of deep neural networks in order to train large, distributed neural networks on high quality images. We find that the performance of this approach increases with the depth of the convolutional network, with the best performance occurring in the deepest architecture we trained, with eleven hidden layers. We evaluate this approach on the publicly available SVHN dataset and achieve over 96%96\% accuracy in recognizing complete street numbers. We show that on a per-digit recognition task, we improve upon the state-of-the-art, achieving 97.84%97.84\% accuracy. We also evaluate this approach on an even more challenging dataset generated from Street View imagery containing several tens of millions of street number annotations and achieve over 90%90\% accuracy. To further explore the applicability of the proposed system to broader text recognition tasks, we apply it to synthetic distorted text from reCAPTCHA. reCAPTCHA is one of the most secure reverse turing tests that uses distorted text to distinguish humans from bots. We report a 99.8%99.8\% accuracy on the hardest category of reCAPTCHA. Our evaluations on both tasks indicate that at specific operating thresholds, the performance of the proposed system is comparable to, and in some cases exceeds, that of human operators

    Bilattice based Logical Reasoning for Automated Visual Surveillance and other Applications

    Get PDF
    The primary objective of an automated visual surveillance system is to observe and understand human behavior and report unusual or potentially dangerous activities/events in a timely manner. Automatically understanding human behavior from visual input, however, is a challenging task. The research presented in this thesis focuses on designing a reasoning framework that can combine, in a principled manner, high level contextual information with low level image processing primitives to interpret visual information. The primary motivation for this work has been to design a reasoning framework that draws heavily upon human like reasoning and reasons explicitly about visual as well as non-visual information to solve classification problems. Humans are adept at performing inference under uncertainty by combining evidence from multiple, noisy and often contradictory sources. This thesis describes a logical reasoning approach in which logical rules encode high level knowledge about the world and logical facts serve as input to the system from real world observations. The reasoning framework supports encoding of multiple rules for the same proposition, representing multiple lines of reasoning and also supports encoding of rules that infer explicit negation and thereby potentially contradictory information. Uncertainties are associated with both the logical rules that guide reasoning as well as with the input facts. This framework has been applied to visual surveillance problems such as human activity recognition, identity maintenance, and human detection. Finally, we have also applied it to the problem of collaborative filtering to predict movie ratings by explicitly reasoning about users preferences

    Multi-cue exemplar-based nonparametric model for gesture recognition

    No full text
    This paper presents an approach for a multi-cue, viewbased recognition of gestures. We describe an exemplarbased technique that combines two different forms of exemplars- shape exemplars and motion exemplars- in a unified probabilistic framework. Each gesture is represented as a sequence of learned body poses as well as a sequence of learned motion parameters. The shape exemplars are comprised of pose contours, and the motion exemplars are represented as affine motion parameters extracted using a robust estimation approach. The probabilistic framework learns by employing a nonparametric estimation technique to model the exemplar distributions. It imposes temporal constraints between different exemplars through a learned Hidden Markov Model (HMM) for each gesture. We use the proposed multi-cue approach to recognize a set of fourteen gestures and contrast it against a shape only, singlecue based system. 1

    VidMAP: Video Monitoring of Activity with Prolog

    No full text
    This paper describes the architecture of a visual surveillance system that combines real time computer vision algorithms with logic programming to represent and recognize activities involving interactions amongst people, packages and the environments through which they move. The low level computer vision algorithms log primitive events of interest as observed facts, while the higher level Prolog based reasoning engine uses these facts in conjunction with predefined rules to recognize various activities in the input video streams. The system is illustrated in action on a multi-camera surveillance scenario that includes both security and safety violations

    Person Re-identification using Semantic Color Names and RankBoost

    No full text
    We address the problem of appearance-based person re-identification, which has been drawing an increasing amount of attention in computer vision. It is a very challenging task since the visual appearance of a person can change dramatically due to different backgrounds, camera characteristics, lighting conditions, view-points, and human poses. Among the recent studies on person re-id, color information plays a major role in terms of performance. Traditional color information like color histogram, however, still has much room to improve. We propose to apply semantic color names to describe a person image, and compute probability distribution on those basic color terms as image descriptors. To be better combined with other features, we define our appearance affinity model as linear combination of similarity measurements of corresponding local descriptors, and apply the RankBoost algorithm to find the optimal weights for the similarity measurements. We evaluate our proposed system on the highly challenging VIPeR dataset, and show improvements over the state-ofthe-art methods in terms of widely used person re-id evaluation metrics. 1

    Multivalued Default Logic for Identity Maintenance in Visual Surveillance

    No full text
    Recognition of complex activities from surveillance video requires detection and temporal ordering of its constituent "atomic" events. It also requires the capacity to robustly track individuals and maintain their identities across single as well as multiple camera views. Identity maintenance is a primary source of uncertainty for activity recognition and has been traditionally addressed via different appearance matching approaches. However these approaches, by themselves, are inadequate. In this paper, we propose a prioritized, multivalued, default logic based framework that allows reasoning about the identities of individuals
    corecore